[Logging]metrics ops logs include event loop delay histogram data #120451

TinaHeiligers · 2021-12-05T20:23:23Z

Resolves #119069

Adds 50th, 95th & 99th percentiles (in ms) of the event loop delay histogram to metrics.ops logs
Ensures the event loop delay histogram data is provided by the /api/status API to allow Monitoring UI to render more detailed metrics.

Logs

debug logs

pattern layout: "[%date][%level][%logger][%meta]---%message"

[2021-12-05T14:02:52.138-07:00][DEBUG][metrics.ops][{"event":{"kind":"metric","category":["process","host"],"type":["info"]},"process":{"uptime":37,"memory":{"heap":{"usedInBytes":494344840}},"eventLoopDelay":11.621535925407926,"eventLoopDelayHistogram":{"50":11.550719,"95":12.623871,"99":12.664831}},"host":{"os":{"load":{"1m":3.7490234375,"5m":8.82666015625,"15m":8.89697265625}}}}]---memory: 471.4MB uptime: 0:00:37 load: [3.75,8.83,8.90] mean delay: 11.622 delay histogram: { 50: 11.551; 95: 12.624; 99: 12.665 }
{"event":{"kind":"metric","category":["process","host"],"type":["info"]},"process":{"uptime":37,"memory":{"heap":{"usedInBytes":494344840}},"eventLoopDelay":11.621535925407926,"eventLoopDelayHistogram":{"50":11.550719,"95":12.623871,"99":12.664831},"pid":4989},"host":{"os":{"load":{"1m":3.7490234375,"5m":8.82666015625,"15m":8.89697265625}}},"ecs":{"version":"8.0.0"},"@timestamp":"2021-12-05T14:02:52.138-07:00","message":"memory: 471.4MB uptime: 0:00:37 load: [3.75,8.83,8.90] mean delay: 11.622 delay histogram: { 50: 11.551; 95: 12.624; 99: 12.665 }","log":{"level":"DEBUG","logger":"metrics.ops"},"trace":{"id":"8bcb3247c71a5ae90736d39169f9ab91"},"transaction":{"id":"9b3566667b7effed"}}

json layout (also shown prettified below):

{"event":{"kind":"metric","category":["process","host"],"type":["info"]},"process":{"uptime":146,"memory":{"heap":{"usedInBytes":515810168}},"eventLoopDelay":11.739656432941176,"eventLoopDelayHistogram":{"50":11.829247,"95":12.615679,"99":12.664831},"pid":5399},"host":{"os":{"load":{"1m":1.939453125,"5m":2.697265625,"15m":4.89892578125}}},"ecs":{"version":"8.0.0"},"@timestamp":"2021-12-05T14:15:53.772-07:00","message":"memory: 491.9MB uptime: 0:02:26 load: [1.94,2.70,4.90] mean delay: 11.740 delay histogram: { 50: 11.829; 95: 12.616; 99: 12.665 }","log":{"level":"DEBUG","logger":"metrics.ops"},"trace":{"id":"cd6d60eb7fcc200e8d0915b2e91c914e"},"transaction":{"id":"e68d05db97f9244c"}}

json layout prettified

{
  "event": {
    "kind": "metric",
    "category": [
      "process",
      "host"
    ],
    "type": [
      "info"
    ]
  },
  "process": {
    "uptime": 146,
    "memory": {
      "heap": {
        "usedInBytes": 515810168
      }
    },
    "eventLoopDelay": 11.739656432941176,
    "eventLoopDelayHistogram": {
      "50": 11.829247,
      "95": 12.615679,
      "99": 12.664831
    },
    "pid": 5399
  },
  "host": {...}
  },
  "ecs": {
    "version": "8.0.0"
  },
  "@timestamp": "2021-12-05T14:15:53.772-07:00",
  "message": "memory: 491.9MB uptime: 0:02:26 load: [1.94,2.70,4.90] mean delay: 11.740 delay histogram: { 50: 11.829; 95: 12.616; 99: 12.665 }",
  "log": {
    "level": "DEBUG",
    "logger": "metrics.ops"
  },
  "trace": {
    "id": "cd6d60eb7fcc200e8d0915b2e91c914e"
  },
  "transaction": {
    "id": "e68d05db97f9244c"
  }
}

`/api/status`

network response body

{
  "name": "christianes-mbp.lan",
  "uuid": "5b2de169-2785-441b-ae8c-186a1936b17d",
  "version": {
    "number": "8.1.0",
    "build_hash": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "build_number": 9007199254740991,
    "build_snapshot": false
  },
  "status": {...},
  "metrics": {
    "last_updated": "2021-12-05T21:38:23.532Z",
    "collection_interval_in_millis": 5000,
    "os": {...},
    "process": {
      "memory": {...},
      "pid": 5830,
      "event_loop_delay": 12.21699262745098,
      "event_loop_delay_histogram": {
        "min": 9.076736,
        "max": 70.385663,
        "mean": 12.21699262745098,
        "exceeds": 0,
        "stddev": 5.6599237565600395,
        "fromTimestamp": "2021-12-05T21:38:18.533Z",
        "lastUpdatedAt": "2021-12-05T21:38:23.528Z",
        "percentiles": {
          "50": 11.116543,
          "75": 12.599295,
          "95": 16.793599,
          "99": 41.287679
        }
      },
      "uptime_in_millis": 966734.90539
    },
    "processes": [
      {
        "memory": {...},
        "pid": 5830,
        "event_loop_delay": 12.21699262745098,
        "event_loop_delay_histogram": {
          "min": 9.076736,
          "max": 70.385663,
          "mean": 12.21699262745098,
          "exceeds": 0,
          "stddev": 5.6599237565600395,
          "fromTimestamp": "2021-12-05T21:38:18.533Z",
          "lastUpdatedAt": "2021-12-05T21:38:23.528Z",
          "percentiles": {
            "50": 11.116543,
            "75": 12.599295,
            "95": 16.793599,
            "99": 41.287679
          }
        },
        "uptime_in_millis": 966734.90539
      }
    ],
    "response_times": {...},
    "concurrent_connections": 1,
    "requests": {...},
      "status_codes": {
        "200": 1,
        "304": 122
      }
    }
  }
}

Checklist

Delete any items that are not applicable to this PR.

Unit or functional tests were updated or added to match the most common scenarios

How to test this:

Pull this branch and run kibana with trial license and the following logging configuration:

// in kibana.yml
logging:
  appenders:
    console:
      type: console
      layout:
        type: pattern
        highlight: true
        pattern: "[%date][%level][%logger][%meta]---%message"
  root:
    appenders: [console]
    level: warn
  loggers:
    - name: metrics.ops
      appenders: [console]
      level: debug
    - name: monitoring
      level: debug
      appenders: [console]

After kibana starts up, wait a few moments to see the output of the metrics.ops logs. Observe that the eventLoopDelayHistogram appears in the log message and under the process field in the log meta.

Navigate to the status page: localhost:5601/<basepath>/status.
Open the browsers' dev tools and inspect the network response from the status api endpoint. Observe that the eventLoopDelay histogram percentiles appears under the metric/process and metric/processes keys.

Risk Matrix

Risk	Probability	Severity	Mitigation/Notes
Ingesting metrics ops logs and relying on the `metrics.ops` message to parse data is risky as the message string has changed	Low	Low	Relying on a particular log message is flaky, use the log meta instead. This PR changes the wording, and adds another entry.

Release note

Adds 50th, 95th and 99th event loop delay histogram percentiles to metrics.ops logger message and meta field surfaced when logging in debug mode or in json layout.

…s ops logs

TinaHeiligers · 2021-12-05T22:18:41Z

src/core/server/metrics/logging/get_ops_metrics_log.test.ts

-    const logMeta = getEcsOpsMetricsLog(createBaseOpsMetrics());
-    expect(logMeta).toMatchInlineSnapshot(`
+    const logMeta = getEcsOpsMetricsLog(createMockOpsMetrics(testMetrics));
+    expect(logMeta.meta).toMatchInlineSnapshot(`


The snapshot was getting too big to easily notice a change, so I've changed this assertion to only test the meta field.

TinaHeiligers · 2021-12-05T22:20:19Z

src/core/server/metrics/logging/get_ops_metrics_log.ts

  const eventLoopDelayVal = process?.event_loop_delay;
  const eventLoopDelayValMsg = eventLoopDelayVal
-    ? `delay: ${numeral(process?.event_loop_delay).format('0.000')}`
+    ? `mean delay: ${numeral(process?.event_loop_delay).format('0.000')}`


Now that we have the histogram percentiles in the log message, we need to let folks know that the first delay entry is the mean.

nit: I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API?

"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] delay: { mean: 50.000, 50: 50.000, 95: 95.000, 99: 99.000 }"

(Edit: Mostly just trying to think how we can keep this as succinct as possible)

I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API.

We could but that would be a breaking change for folks ingesting the log message. ATM, if someone was parsing the message string they might expect delay: number and end up with delay: [Object].

Admittedly, this PR also changes the message string but keeps the delay type the same. Appending the additional data seems like a less risky option here.

Admittedly, ingesting log message data isn't the best approach and folks should be using the meta instead. However, we removed meta from all the logs expect for debug mode so the extra info wouldn't be available for other levels.

We could but that would be a breaking change for folks ingesting the log message.

~~Yeah, although I suppose we are adding mean delay so a grok pattern against just delay could match two results now (both mean and histogram).~~ EDIT: Actually I guess not -- I misread, we have mean delay and delay histogram 🙂

However, in the past we have not considered the message format to be stable, and have made breaking changes to it in minors anyway. So I am not super concerned about this either way.

TinaHeiligers

self review

elasticmachine · 2021-12-06T00:31:47Z

Pinging @elastic/kibana-core (Team:Core)

pgayvallet

LGTM

Are we planning on updating the /status page to display these additional values btw?

pgayvallet · 2021-12-07T08:57:58Z

src/core/server/metrics/logging/get_ops_metrics_log.ts

@@ -75,7 +95,7 @@ export function getEcsOpsMetricsLog(metrics: OpsMetrics) {
  };

  return {
-    message: `${processMemoryUsedInBytesMsg}${uptimeValMsg}${loadValsMsg}${eventLoopDelayValMsg}`,
+    message: `${processMemoryUsedInBytesMsg}${uptimeValMsg}${loadValsMsg}${eventLoopDelayValMsg}${eventLoopDelayHistMsg}`,


NIT/optional: I wonder if

message: [ processMemoryUsedInBytesMsg, uptimeValMsg, loadValsMsg, eventLoopDelayValMsg, eventLoopDelayHistMsg, ].join('');

isn't more readable

Are we planning on updating the /status page to display these additional values btw?

It's not in the scope for this issue but I don't see why not!
How would we display the data? We could use the same cards we use now and add each entry individually or perhaps we could add another level to the metrics cards with details that would expand if a "show more" or "details" was clicked?
Maybe a simpler option would be to only show say the 50th, 95th and 99th percentiles in the same delay card, at the bottom in smaller text?

Update: I created #120667 for discussion on the approach.

lukeelmers

Are we planning on updating the /status page to display these additional values btw?

++ This would be a nice addition

lukeelmers · 2021-12-06T21:39:49Z

src/core/server/metrics/logging/get_ops_metrics_log.ts

  const eventLoopDelayVal = process?.event_loop_delay;
  const eventLoopDelayValMsg = eventLoopDelayVal
-    ? `delay: ${numeral(process?.event_loop_delay).format('0.000')}`
+    ? `mean delay: ${numeral(process?.event_loop_delay).format('0.000')}`


nit: I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API?

"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] delay: { mean: 50.000, 50: 50.000, 95: 95.000, 99: 99.000 }"

(Edit: Mostly just trying to think how we can keep this as succinct as possible)

kibana-ci · 2021-12-07T19:38:30Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #11170 succeeded 01135bd
💚 Build #11169 succeeded 6f8f63f

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…astic#120451)

kibanamachine · 2021-12-08T15:09:39Z

💔 Backport failed

Status	Branch	Result
✅	7.16
❌	7.17	The branch "7.17" is invalid or doesn't exist

Successful backport PRs will be merged automatically after passing CI.

To backport manually run:
node scripts/backport --pr 120451

kibanamachine · 2021-12-09T15:44:59Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 120451 or prevent reminders by adding the backport:skip label.

kibanamachine · 2021-12-10T15:54:58Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 120451 or prevent reminders by adding the backport:skip label.

kibanamachine · 2021-12-13T16:04:58Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 120451 or prevent reminders by adding the backport:skip label.

…astic#120451)

kibanamachine · 2021-12-13T22:59:27Z

💔 Backport failed

Status	Branch	Result
✅	8.0
❌	7.17	The branch "7.17" is invalid or doesn't exist

Successful backport PRs will be merged automatically after passing CI.

To backport manually run:
node scripts/backport --pr 120451

…20451) (#121145) Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>

…astic#120451)

…20451) (#121430)

…astic#120451)

Adds event loop delay histogram metrics (50th, 95th & 99th) to metric…

2ff3fb6

…s ops logs

TinaHeiligers added release_note:enhancement Feature:Logging v8.1.0 labels Dec 5, 2021

Removes unused

6f8f63f

elastic deleted a comment from kibana-ci Dec 5, 2021

restores plugins/.empty

01135bd

TinaHeiligers commented Dec 5, 2021

View reviewed changes

TinaHeiligers added the backport:skip This commit does not require backporting label Dec 5, 2021

TinaHeiligers marked this pull request as ready for review December 5, 2021 22:31

TinaHeiligers requested a review from a team as a code owner December 5, 2021 22:31

TinaHeiligers added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Dec 6, 2021

pgayvallet approved these changes Dec 7, 2021

View reviewed changes

lukeelmers approved these changes Dec 7, 2021

View reviewed changes

Improves ops metrics log message

8231dda

TinaHeiligers merged commit 80dc3b6 into elastic:main Dec 7, 2021

TinaHeiligers deleted the improve-event-loop-delay-representation branch December 7, 2021 20:04

TinaHeiligers added v7.16.1 v7.16.2 v7.17.0 auto-backport Deprecated - use backport:version if exact versions are needed and removed backport:skip This commit does not require backporting labels Dec 8, 2021

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 8, 2021

[Logging]metrics ops logs include event loop delay histogram data (el…

aa972c1

…astic#120451)

kibanamachine mentioned this pull request Dec 8, 2021

[Logging]metrics ops logs include event loop delay histogram data (#120451) #120774

Closed

TinaHeiligers restored the improve-event-loop-delay-representation branch December 8, 2021 15:09

TinaHeiligers added v7.17.0 and removed v7.17.0 v7.16.1 v7.16.2 labels Dec 8, 2021

kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Dec 9, 2021

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 13, 2021

[Logging]metrics ops logs include event loop delay histogram data (el…

de777f7

…astic#120451)

kibanamachine mentioned this pull request Dec 13, 2021

[8.0] [Logging]metrics ops logs include event loop delay histogram data (#120451) #121145

Merged

kibanamachine added a commit that referenced this pull request Dec 14, 2021

[Logging]metrics ops logs include event loop delay histogram data (#1…

8e39e87

…20451) (#121145) Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>

TinaHeiligers added a commit to TinaHeiligers/kibana that referenced this pull request Dec 16, 2021

[Logging]metrics ops logs include event loop delay histogram data (el…

4e44ffb

…astic#120451)

TinaHeiligers mentioned this pull request Dec 16, 2021

[7.17] [Logging]metrics ops logs include event loop delay histogram data (#120451) #121430

Merged

TinaHeiligers added a commit that referenced this pull request Dec 16, 2021

[Logging]metrics ops logs include event loop delay histogram data (#1…

3474216

…20451) (#121430)

TinLe pushed a commit to TinLe/kibana that referenced this pull request Dec 22, 2021

[Logging]metrics ops logs include event loop delay histogram data (el…

de575eb

…astic#120451)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Logging]metrics ops logs include event loop delay histogram data #120451

[Logging]metrics ops logs include event loop delay histogram data #120451

TinaHeiligers commented Dec 5, 2021 •

edited

Loading

TinaHeiligers Dec 5, 2021

TinaHeiligers Dec 5, 2021 •

edited

Loading

lukeelmers Dec 6, 2021 •

edited

Loading

TinaHeiligers Dec 7, 2021

lukeelmers Dec 7, 2021 •

edited

Loading

TinaHeiligers left a comment

elasticmachine commented Dec 6, 2021

pgayvallet left a comment

pgayvallet Dec 7, 2021

TinaHeiligers Dec 7, 2021 •

edited

Loading

lukeelmers left a comment

lukeelmers Dec 6, 2021 •

edited

Loading

kibana-ci commented Dec 7, 2021

kibanamachine commented Dec 8, 2021

kibanamachine commented Dec 9, 2021

kibanamachine commented Dec 10, 2021

kibanamachine commented Dec 13, 2021

kibanamachine commented Dec 13, 2021

[Logging]metrics ops logs include event loop delay histogram data #120451

[Logging]metrics ops logs include event loop delay histogram data #120451

Conversation

TinaHeiligers commented Dec 5, 2021 • edited Loading

Logs

/api/status

Checklist

How to test this:

Risk Matrix

Release note

TinaHeiligers Dec 5, 2021

Choose a reason for hiding this comment

TinaHeiligers Dec 5, 2021 • edited Loading

Choose a reason for hiding this comment

lukeelmers Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

TinaHeiligers Dec 7, 2021

Choose a reason for hiding this comment

lukeelmers Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

TinaHeiligers left a comment

Choose a reason for hiding this comment

elasticmachine commented Dec 6, 2021

pgayvallet left a comment

Choose a reason for hiding this comment

pgayvallet Dec 7, 2021

Choose a reason for hiding this comment

TinaHeiligers Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

lukeelmers left a comment

Choose a reason for hiding this comment

lukeelmers Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

kibana-ci commented Dec 7, 2021

💚 Build Succeeded

Metrics [docs]

History

kibanamachine commented Dec 8, 2021

💔 Backport failed

kibanamachine commented Dec 9, 2021

kibanamachine commented Dec 10, 2021

kibanamachine commented Dec 13, 2021

kibanamachine commented Dec 13, 2021

💔 Backport failed

TinaHeiligers commented Dec 5, 2021 •

edited

Loading

`/api/status`

TinaHeiligers Dec 5, 2021 •

edited

Loading

lukeelmers Dec 6, 2021 •

edited

Loading

lukeelmers Dec 7, 2021 •

edited

Loading

TinaHeiligers Dec 7, 2021 •

edited

Loading

lukeelmers Dec 6, 2021 •

edited

Loading